21 research outputs found
Omni-Dimensional Dynamic Convolution
Learning a single static convolutional kernel in each convolutional layer is
the common training paradigm of modern Convolutional Neural Networks (CNNs).
Instead, recent research in dynamic convolution shows that learning a linear
combination of convolutional kernels weighted with their input-dependent
attentions can significantly improve the accuracy of light-weight CNNs, while
maintaining efficient inference. However, we observe that existing works endow
convolutional kernels with the dynamic property through one dimension
(regarding the convolutional kernel number) of the kernel space, but the other
three dimensions (regarding the spatial size, the input channel number and the
output channel number for each convolutional kernel) are overlooked. Inspired
by this, we present Omni-dimensional Dynamic Convolution (ODConv), a more
generalized yet elegant dynamic convolution design, to advance this line of
research. ODConv leverages a novel multi-dimensional attention mechanism with a
parallel strategy to learn complementary attentions for convolutional kernels
along all four dimensions of the kernel space at any convolutional layer. As a
drop-in replacement of regular convolutions, ODConv can be plugged into many
CNN architectures. Extensive experiments on the ImageNet and MS-COCO datasets
show that ODConv brings solid accuracy boosts for various prevailing CNN
backbones including both light-weight and large ones, e.g.,
3.77%~5.71%|1.86%~3.72% absolute top-1 improvements to MobivleNetV2|ResNet
family on the ImageNet dataset. Intriguingly, thanks to its improved feature
learning ability, ODConv with even one single kernel can compete with or
outperform existing dynamic convolution counterparts with multiple kernels,
substantially reducing extra parameters. Furthermore, ODConv is also superior
to other attention modules for modulating the output features or the
convolutional weights.Comment: Spotlight paper at ICLR 2022. Code and models are available at
https://github.com/OSVAI/ODCon
Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design
Sparse training is one of the promising techniques to reduce the
computational cost of DNNs while retaining high accuracy. In particular, N:M
fine-grained structured sparsity, where only N out of consecutive M elements
can be nonzero, has attracted attention due to its hardware-friendly pattern
and capability of achieving a high sparse ratio. However, the potential to
accelerate N:M sparse DNN training has not been fully exploited, and there is a
lack of efficient hardware supporting N:M sparse training. To tackle these
challenges, this paper presents a computation-efficient training scheme for N:M
sparse DNNs using algorithm, architecture, and dataflow co-design. At the
algorithm level, a bidirectional weight pruning method, dubbed BDWP, is
proposed to leverage the N:M sparsity of weights during both forward and
backward passes of DNN training, which can significantly reduce the
computational cost while maintaining model accuracy. At the architecture level,
a sparse accelerator for DNN training, namely SAT, is developed to neatly
support both the regular dense operations and the computation-efficient N:M
sparse operations. At the dataflow level, multiple optimization methods ranging
from interleave mapping, pre-generation of N:M sparse weights, and offline
scheduling, are proposed to boost the computational efficiency of SAT. Finally,
the effectiveness of our training scheme is evaluated on a Xilinx VCU1525 FPGA
card using various DNN models and datasets. Experimental results show the SAT
accelerator with the BDWP sparse training method under 2:8 sparse ratio
achieves an average speedup of 1.75x over that with the dense training,
accompanied by a negligible accuracy loss of 0.56% on average. Furthermore, our
proposed training scheme significantly improves the training throughput by
2.97~25.22x and the energy efficiency by 1.36~3.58x over prior FPGA-based
accelerators.Comment: To appear in the IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems (TCAD
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
The popularity of Contrastive Language-Image Pre-training (CLIP) has
propelled its application to diverse downstream vision tasks. To improve its
capacity on downstream tasks, few-shot learning has become a widely-adopted
technique. However, existing methods either exhibit limited performance or
suffer from excessive learnable parameters. In this paper, we propose APE, an
Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which
achieves superior accuracy with high computational efficiency. Via a prior
refinement module, we analyze the inter-class disparity in the downstream data
and decouple the domain-specific knowledge from the CLIP-extracted cache model.
On top of that, we introduce two model variants, a training-free APE and a
training-required APE-T. We explore the trilateral affinities between the test
image, prior cache model, and textual representations, and only enable a
lightweight category-residual module to be trained. For the average accuracy
over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively
outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less
learnable parameters.Comment: Code is available at https://github.com/yangyangyang127/AP
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
We present LLaMA-Adapter, a lightweight adaption method to efficiently
fine-tune LLaMA into an instruction-following model. Using 52K self-instruct
demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon
the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8
A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and
prepend them to the input text tokens at higher transformer layers. Then, a
zero-init attention mechanism with zero gating is proposed, which adaptively
injects the new instructional cues into LLaMA, while effectively preserves its
pre-trained knowledge. With efficient training, LLaMA-Adapter generates
high-quality responses, comparable to Alpaca with fully fine-tuned 7B
parameters. Furthermore, our approach can be simply extended to multi-modal
input, e.g., images, for image-conditioned LLaMA, which achieves superior
reasoning capacity on ScienceQA. We release our code at
https://github.com/ZrrSkywalker/LLaMA-Adapter.Comment: Work in Progress. Code is available at
https://github.com/ZrrSkywalker/LLaMA-Adapte
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has
brought significant advancements in addressing math reasoning problems. In
particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter,
shows remarkable performance on challenging math datasets. In this paper, we
explore the effect of code on enhancing LLMs' reasoning capability by
introducing different constraints on the \textit{Code Usage Frequency} of GPT-4
Code Interpreter. We found that its success can be largely attributed to its
powerful skills in generating and executing code, evaluating the output of code
execution, and rectifying its solution when receiving unreasonable outputs.
Based on this insight, we propose a novel and effective prompting method,
explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further
boost the mathematical reasoning potential of GPT-4 Code Interpreter. This
method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to
use code to self-verify its answers. In instances where the verification state
registers as ``False'', the model shall automatically amend its solution,
analogous to our approach of rectifying errors during a mathematics
examination. Furthermore, we recognize that the states of the verification
result indicate the confidence of a solution, which can improve the
effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we
achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\%
84.3\%)}.Comment: Solving Challenging Math Word Problems Using GPT-4 Code Interpreter
with Code-based Self-Verificatio
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
How to efficiently transform large language models (LLMs) into instruction
followers is recently a popular research direction, while training LLM for
multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter
demonstrates the potential to handle visual inputs with LLMs, it still cannot
generalize well to open-ended visual instructions and lags behind GPT-4. In
this paper, we present LLaMA-Adapter V2, a parameter-efficient visual
instruction model. Specifically, we first augment LLaMA-Adapter by unlocking
more learnable parameters (e.g., norm, bias and scale), which distribute the
instruction-following ability across the entire LLaMA model besides adapters.
Secondly, we propose an early fusion strategy to feed visual tokens only into
the early LLM layers, contributing to better visual knowledge incorporation.
Thirdly, a joint training paradigm of image-text pairs and
instruction-following data is introduced by optimizing disjoint groups of
learnable parameters. This strategy effectively alleviates the interference
between the two tasks of image-text alignment and instruction following and
achieves strong multi-modal reasoning with only a small-scale image-text and
instruction dataset. During inference, we incorporate additional expert models
(e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image
understanding capability without incurring training costs. Compared to the
original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal
instructions by merely introducing 14M parameters over LLaMA. The newly
designed framework also exhibits stronger language-only instruction-following
capabilities and even excels in chat interactions. Our code and models are
available at https://github.com/ZrrSkywalker/LLaMA-Adapter.Comment: Code and models are available at
https://github.com/ZrrSkywalker/LLaMA-Adapte